[CLEAN] Synthetic Benchmark PR #138126 - Fix stats performance #29

tomerqodo · 2025-12-04T20:37:30Z

User description

Benchmark PR elastic#138126

Type: Clean (correct implementation)

Original PR Title: Fix stats performance
Original PR Description: This fixes the N^2 performance problem described in elastic#97222. In addition to restoring the previous partial fix (elastic#130857), it does the following:

IndicesQueryCache::getStats now accepts a Supplier so that we can only call IndicesQueryCache::getSharedRamSizeForAllShards if it is absolutely needed. This fixes an N^2 performance problem that Improving statsByShard performance when the number of shards is very large elastic/elasticsearch#130857 introduced. If a user called TransportIndicesStatsAction but did not request query cache stats, then before Improving statsByShard performance when the number of shards is very large elastic/elasticsearch#130857 we did not enter the N^2 loop (it was only entered if a user did request query cache stats). But after Improving statsByShard performance when the number of shards is very large elastic/elasticsearch#130857, we had the N^2 performance all the time. This is a pretty big problem for clusters with large shards since this is called very frequently (including every 30 seconds by a background task).
It fixes the N^2 performance in TransportIndicesStatsAction by sharing state across all shardOperation calls on a single node using the new NodeContext feature from Adding NodeContext to TransportBroadcastByNodeAction elastic/elasticsearch#138057.

Closes elastic#97222
Original PR URL: elastic#138126

PR Type

Bug fix, Enhancement

Description

Fix N^2 performance problem in stats APIs by caching shared RAM calculations
Introduce CacheTotals record and refactor shared RAM distribution logic
Update IndicesQueryCache.getStats() to accept precomputed shared RAM supplier
Modify TransportIndicesStatsAction to use NodeContext for state sharing across shards
Update TransportClusterStatsAction to cache shared RAM calculations per node

Diagram Walkthrough

flowchart LR
  A["Stats API Requests"] -->|"Previously: O(N²) loop"| B["IndicesQueryCache.getStats"]
  A -->|"Now: Cached once per node"| C["CachedSupplier wrapper"]
  C -->|"Computes totals once"| D["getCacheTotalsForAllShards"]
  D -->|"Distributes to shards"| E["getSharedRamSizeForShard"]
  E -->|"Returns precomputed value"| B

File Walkthrough

Relevant files

Bug fix, enhancement

3 files

IndicesQueryCache.java `Refactor shared RAM calculation with CacheTotals record`	+81/-33
TransportIndicesStatsAction.java `Use NodeContext to cache query cache totals across shards`	+19/-3
TransportClusterStatsAction.java `Cache shared RAM calculations using CachedSupplier`	+16/-3

Enhancement

2 files

CommonStats.java `Accept precomputed shared RAM supplier in getShardLevelStats`	+8/-2
IndicesService.java `Pass precomputed shared RAM to indexShardStats method`	+16/-8

Tests

5 files

IndicesQueryCacheTests.java `Update tests to use new getStats supplier parameter`	+129/-82
IndicesServiceCloseTests.java `Update cache stats calls with precomputed shared RAM`	+7/-7
IndicesServiceTests.java `Update mocks to handle new indexShardStats signature`	+10/-4
VersionStatsTests.java `Update getShardLevelStats call with shared RAM supplier`	+1/-1
IndexShardTests.java `Update getShardLevelStats call with shared RAM supplier`	+1/-1

Documentation

1 files

138126.yaml `Add changelog entry for performance improvement`	+6/-0

qodo-code-review · 2025-12-04T20:38:08Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢	No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: No audit logs: The new performance-related logic (e.g., caching suppliers and node context) introduces critical stats behavior changes without adding any audit logging for accesses or computations that could be considered sensitive operational actions. Referred Code @Override protected Supplier<IndicesQueryCache.CacheTotals> createNodeContext() { return CachedSupplier.wrap(() -> IndicesQueryCache.getCacheTotalsForAllShards(indicesService)); } @Override protected void shardOperation( IndicesStatsRequest request, ShardRouting shardRouting, Task task, Supplier<IndicesQueryCache.CacheTotals> context, ActionListener<ShardStats> listener ) { ActionListener.completeWith(listener, () -> { assert task instanceof CancellableTask; IndexService indexService = indicesService.indexServiceSafe(shardRouting.shardId().getIndex()); IndexShard indexShard = indexService.getShard(shardRouting.shardId().id()); CommonStats commonStats = CommonStats.getShardLevelStats( indicesService.getIndicesQueryCache(), indexShard, request.flags(), ... (clipped 6 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Null cache handling: New helper methods compute shared RAM and stats across shards but rely on external services and suppliers without explicit error handling or null checks beyond basic ternaries, which may require verification in broader context. Referred Code public static Map<ShardId, Long> getSharedRamSizeForAllShards(IndicesService indicesService) { Map<ShardId, Long> shardIdToSharedRam = new HashMap<>(); IndicesQueryCache.CacheTotals cacheTotals = IndicesQueryCache.getCacheTotalsForAllShards(indicesService); for (IndexService indexService : indicesService) { for (IndexShard indexShard : indexService) { final var queryCache = indicesService.getIndicesQueryCache(); long sharedRam = (queryCache == null) ? 0L : queryCache.getSharedRamSizeForShard(indexShard.shardId(), cacheTotals); // as a size optimization, only store non-zero values in the map if (sharedRam > 0L) { shardIdToSharedRam.put(indexShard.shardId(), sharedRam); } } } return Collections.unmodifiableMap(shardIdToSharedRam); } public long getCacheSizeForShard(ShardId shardId) { Stats stats = shardStats.get(shardId); return stats != null ? stats.cacheSize : 0L; } ... (clipped 47 lines) Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2025-12-04T20:39:04Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
General	Hoist query cache retrieval out of loop Hoist the `indicesService.getIndicesQueryCache()` call out of the nested loop and add an early null check to improve performance. server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java [83-97] public static Map<ShardId, Long> getSharedRamSizeForAllShards(IndicesService indicesService) { + final var queryCache = indicesService.getIndicesQueryCache(); + if (queryCache == null) { + return Collections.emptyMap(); + } Map<ShardId, Long> shardIdToSharedRam = new HashMap<>(); IndicesQueryCache.CacheTotals cacheTotals = IndicesQueryCache.getCacheTotalsForAllShards(indicesService); for (IndexService indexService : indicesService) { for (IndexShard indexShard : indexService) { - final var queryCache = indicesService.getIndicesQueryCache(); - long sharedRam = (queryCache == null) ? 0L : queryCache.getSharedRamSizeForShard(indexShard.shardId(), cacheTotals); + long sharedRam = queryCache.getSharedRamSizeForShard(indexShard.shardId(), cacheTotals); // as a size optimization, only store non-zero values in the map if (sharedRam > 0L) { shardIdToSharedRam.put(indexShard.shardId(), sharedRam); } } } return Collections.unmodifiableMap(shardIdToSharedRam); } Apply / Chat Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies a performance improvement by hoisting the `getIndicesQueryCache()` call out of a nested loop, which is a valid optimization.	Low
General	Avoid redundant query cache retrieval Remove the redundant `indicesService.getIndicesQueryCache()` call inside the lambda by capturing and reusing the `queryCache` from the outer scope. server/src/main/java/org/elasticsearch/action/admin/indices/stats/TransportIndicesStatsAction.java [135-143] CommonStats commonStats = CommonStats.getShardLevelStats( indicesService.getIndicesQueryCache(), indexShard, request.flags(), () -> { final IndicesQueryCache queryCache = indicesService.getIndicesQueryCache(); - return (queryCache == null) ? 0L : queryCache.getSharedRamSizeForShard(indexShard.shardId(), context.get()); + if (queryCache == null) { + return 0L; + } + return queryCache.getSharedRamSizeForShard(indexShard.shardId(), context.get()); } ); Apply / Chat Suggestion importance[1-10]: 4 __ Why: The suggestion correctly points out a redundant call to `indicesService.getIndicesQueryCache()` inside a lambda and proposes a valid simplification for better code clarity.	Low
More

Apply changes for benchmark PR

ad23461

qodo-code-review bot added the Review effort 2/5 label Dec 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CLEAN] Synthetic Benchmark PR #138126 - Fix stats performance #29

[CLEAN] Synthetic Benchmark PR #138126 - Fix stats performance #29

Uh oh!

tomerqodo commented Dec 4, 2025 •

edited by qodo-code-review bot

Loading

Uh oh!

qodo-code-review bot commented Dec 4, 2025

Uh oh!

qodo-code-review bot commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[CLEAN] Synthetic Benchmark PR #138126 - Fix stats performance #29

Are you sure you want to change the base?

[CLEAN] Synthetic Benchmark PR #138126 - Fix stats performance #29

Uh oh!

Conversation

tomerqodo commented Dec 4, 2025 • edited by qodo-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Benchmark PR elastic#138126

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-code-review bot commented Dec 4, 2025

PR Compliance Guide 🔍

Uh oh!

qodo-code-review bot commented Dec 4, 2025

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomerqodo commented Dec 4, 2025 •

edited by qodo-code-review bot

Loading